AI Project 4 - Mohsen Amjadi 810896043

In this project we are going to build some prediction models and the goal is to predict the sales price for each house and evaluate that how is the trained model accuracy. for this purpose we are going to use some machine learning techniques and data analysis and also preproccesing tasks to provide better understanding of our data and also better models.

Load and check the data

Verify NaN value:

First thing to do is get rid of the features with more than 80% missing values (figure below). For example the PoolQC's missing values are probably due to the lack of pools in some buildings, which is very logical. But replacing those (more than 80%) missing values with "no pool" will leave us with a feature with low variance, and low variance features are uniformative for machine learning models. So we drop the features with more than 80% missing values.

PS: In this version, I lower the threshold to 20% to drop more columns

merge test and train data

replace nan values

One column corresponding to the year the garage was built, here NaN because no garage. columns will be transformed as categorical variables

Verify dtypes: <h2/ >

Features engineering <h2/ >

Collapsing data into categories <h2/ >

GarageYrBlt: int to categorical

Verify Value consistency <h2/ >

Collapsing too many categories to few <h2/ >

Making sure all the relevant data are categorical variables.<h3/ >

Analysing target variable: 'SalePrice'

Check the asymmetry of the probability distribution

Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution.

We can see:

Log transform skewed numeric features:

We want our skewness value to be around 0 and kurtosis less than 3.

Here are two examples of skewed features: Ground living area and 1st floor SF. We will apply np.log1p to the skewed variables. <p/ >

Now, the skew seems corrected and the data appears more normally distributed.

Skewed features:<h3/ >

Box Cox Transformation of (highly) skewed features<h3/ >

We use the scipy function boxcox1p which computes the Box-Cox transformation of 1+x .

Note that setting λ=0 is equivalent to log1p used above for the target variable.<p/ >

Numerical values

Correlations between variables: heatmap

Observation

As we can see, the multicollinearity still exists in various features. However, we will keep them for now for the sake of learning. Let's go through some of the correlations that still exists.<p/ >

There is 0.83 or 83% correlation between GarageYrBlt and YearBuilt. 83% correlation between TotRmsAbvGrd and GrLivArea. 89% correlation between GarageCars and GarageArea. Similarly many other features such asBsmtUnfSF, FullBath have good correlation with other independent feature. If I were using only multiple linear regression, I would be deleting these features from the dataset to fit better multiple linear regression algorithms. However, we will be using many algorithms as scikit learn modules makes it easy to implement them and get the best possible outcome. Therefore, we will keep all the features for now.<p/ >

plot correlations and residuals

Here, we see that the charts on the right have Homoscedasticity(almost an equal amount of variance across the zero lines). It is because we have trainsformed the target variable using numpy.log1p. Overwise, the residual plot may if seems like there is a linear relationship between the response variable (increases in variance associated with the increase of the target variables = Heteroscedasticity). <p/ >

Outliers detection:

What is an outlier exactly? It’s a data point that is significantly different from other data points in a data set.

Looking at variables together can help you spot common-sense outliers. Say a study is using both people’s ages and marital status to draw conclusions. If you look at variables separately, you might miss outliers. For example, “12 years old” isn’t an outlier and “widow” isn’t an outlier, but we know that a 12-year-old widow is likely an outlier

For multivariate data, scatterplots can be very effective. Scatterplots show a collection of data points, where the x-axis (horizontal) represents the independent variable and the y-axis (vertical) represents the dependent variable.

Categorical variables

split the df with all data into the training and testing set

MODEL

Preprocessing <h2/ >

Split the data <h3/ >

Scale the data <h3/ >

We use RobustScaler to scale the data because it's powerful against outliers, we already detected some but there must be some other outliers out there.

Linear regression <h2/ >

GridSearchCV() <h3/ >

Best fit <h3/ >

Evaluation Metrics

Here I use three metrics for evaluation Mean Absolute Error, Mean Squared Error, and Root Mean Squared Error. MSE in our problem is very big and can't give us much information. But about MAE and RMSE:

$$ {MAE} \text{ } = \frac{{1} \text{ } }{{n} \text{ }}\sum_{j=1}^{n}|y_j - \hat y_j| $$
$$ {RMSE} \text{ } = \sqrt{ (\sum_{i=1}^{n}\frac{{(\hat y_i - y_i)^2} \text{ } }{{n} \text{ }}) } $$

Differences: Taking the square root of the average squared errors has some interesting implications for RMSE. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable. The three tables below show examples where MAE is steady and RMSE increases as the variance associated with the frequency distribution of error magnitudes also increases.

Linear Regression

The most simple way of traning a regression model is using the Linear Regression algorrithm. This Algorithm hasn't any specific parameter to tune. LinearRegression fits a linear model with coefficients to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

K Nearest Neighbors Regression

Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors. Here we try to tune the n_neighbors parameter to finding best number of neighbors in our model.

Decision Tree Regression

Decision Trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. Here, I try to tune max_depth parameter to find best depth that our model can train. As we can see from result after third depth, data get overfitted. So, I use 3 as best depth for our model.

Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

How To Limit Overfitting

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:

The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data.

Random Forest Regression

In random forests, each tree in the ensemble is built from a sample drawn with replacement (i.e., a bootstrap sample) from the training set. Furthermore, when splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size max_features. Here, I try to tune max_depth parameter to find best depth that our model can train. As we can see from result as depth go further, we have better result. So, I use 20 as best depth for our model. If I have more time I would check for best depth.

DecisionTreeRegressor

Gridsearch <h3/ >

run DecisionTreeRegressor with best score <h3/ >

RandomForestRegressor <h2/ >

Gridsearch <h3/ >

run RandomForestRegressor with best score <h3/ >